Method and apparatus for processing two or more initially decoded audio signals received or replayed from a bitstream

ABSTRACT

In the MPEG-4 standard ISO/IEC 14496:2001 several audio objects that can be coded with different MPEG-4 format coding types can together form a composed audio system representing a single soundtrack from the several audio substreams. In a receiver the multiple audio objects are decoded separately, but not directly played back to a listener. Instead, transmitted instructions for mixdown are used to prepare a single soundtrack. Mixdown conflicts can occur in case the audio signals to be combined have different channel numbers or configurations. According to the invention an additional audio channel configuration node is used that tags the correct channel configuration information items to the decoded audio data streams to be presented. The invention enables the content provider to set the channel configuration in such a way that the presenter at receiver side can produce a correct channel presentation under all circumstances. An escape code value in the channel configuration data facilitates correct handling of not yet defined channel combinations.

This application claims the benefit, under 35 U.S.C. §365 ofInternational Application PCT/EP03/13172, filed Nov. 24, 2003, which waspublished in accordance with PCT Article 21(2) on Jun. 17, 2004 inEnglish and which claims the benefit of European patent application No.02026779.5, filed Dec. 2, 2002 and European patent application No.02026779.5, filed Dec. 2, 2002.

The invention relates to a method and to an apparatus for processing twoor more initially decoded audio signals received or replayed from abitstream, that each have a different number of channels and/ordifferent channel configurations, and that are combined before beingpresented in a final channel configuration.

BACKGROUND

In the MPEG-4 standard ISO/IEC 14496:2001, in particular in part 3 Audioand in part 1 Systems, several audio objects that can be coded withdifferent MPEG-4 format coding types can together form a composed audiosystem representing a single soundtrack from the several audiosubstreams. User interaction, terminal capability, and speakerconfiguration may be used when determining how to produce a singlesoundtrack from the component objects. Audio composition means mixingmultiple individual audio objects to create a single soundtrack, e.g. asingle channel or a single stereo pair. A set of instructions formixdown is transmitted or transferred in the bitstream. In a receiverthe multiple audio objects are decoded separately, but not directlyplayed back to a listener. Instead, the transmitted instructions formixdown are used to prepare a single soundtrack from the decoded audioobjects. This final soundtrack is then played for the listener.

ISO/IEC 14496:2001 is the second version of the MPEG-4 Audio standard,whereas ISO/IEC 14496 is the first version. In the above MPEG-4 Audiostandard nodes for presenting audio are described. Header streams thatcontain configuration information, which is necessary for decoding theaudio substreams are transported via MPEG-4 Systems. In a simple audioscene the channel configuration of the audio decoder—for example 5.1multichannel—can be fed inside the Compositor from one node to thefollowing node so that the channel configuration information can reachthe presenter, which is responsible for the correct loudspeaker mapping.The presenter represents that final part of the audio chain which is nomore under the control of the broadcaster or content provider, e.g. anaudio amplifier having volume control and the attached loudspeakers.

‘Node’ means a processing step or unit used in the above MPEG-4standard, e.g. an interface carrying out time synchronisation between adecoder and subsequent processing units, or a corresponding interfacebetween the presenter and an upstream processing unit. In general, inISO/IEC 14496-1:2001 the scene description is represented using aparametric approach. The description consists of an encoded hierarchy ortree of nodes with attributes and other information including eventsources and targets. Leaf nodes in this tree correspond to elementaryaudio-visual data, whereas intermediate nodes group this material toform audio-visual objects, and perform e.g. grouping and transformationon such audio-visual objects (scene description nodes).

Audio decoders either have a predetermined channel configuration bydefinition, or receive e.g. some configuration information items forsetting their channel configuration.

INVENTION

Normally, in an audio processing tree the channel configuration of theaudio decoders can be used for the loudspeaker mapping occurring afterpassing the sound node, see ISO/IEC 14496-3:2001, chapter 1.6.3.4Channel Configuration. Therefore, as shown in FIG. 1, an MPEG-4 playerimplementation passes these information items, that are transmittedwithin a received MPEG-4 bitstream, together with the decoder output oroutputs through the audio nodes AudioSource and Sound2D to thepresenter. The channel configuration data ChannelConfig is to be used bythe presenter to make the correct loudspeaker association, especially incase of multi-channel audio (numchan>1) where the phaseGroup flags inthe audio nodes are to be set.

However, when combining or composing audio substreams having differentchannel assignments, e.g. 5.1 multichannel surround sound and 2.0stereo, some of the audio nodes (AudioMix, AudioSwitch and AudioFX)defined in the current MPEG-4 standard mentioned above can change thefixed channel assignment that is required for the correct channelrepresentation, i.e. such audio nodes have a channel-variant behaviourleading to conflicts in the channel configuration transmission.

A problem to be solved by the invention is to deal properly with suchchannel configuration conflicts such that the presenter can replay soundwith the correct or the desired channel assignments. This problem issolved by the method disclosed in claim 1. An apparatus that utilisesthis method is disclosed in claim 3.

The invention discloses different but related ways of solving suchchannel configuration confusion by using channel-variant audio nodes. Anadditional audio channel configuration node is used, or itsfunctionality is added to the existing audio mixing and/or switchingnodes. This additional audio channel configuration node tags the correctchannel configuration information items to the decoded audio datastreams that pass through the Sound2D node to the presenter.

Advantageously, the invention enables the content provider orbroadcaster to set the channel configuration in such a way that thepresenter at receiver side can produce a correct channel presentationunder all circumstances. An escape code value in the channelconfiguration data facilitates correct handling of not yet definedchannel combinations even in case signals having different channelconfigurations are mixed and/or switched together.

The invention can also be used in any other multi-channel applicationwherein the received channel data are passed through a post processingunit having the inherent ability to interchange the received channels atreproduction.

In principle, the inventive method is suited for processing two or moreinitially decoded audio signals received or replayed from a bitstream,that each have a different number of channels and/or different channelconfigurations, and that are combined by mixing and/or switching beforebeing presented in a final channel configuration, wherein to each one ofsaid initially decoded audio signals a corresponding specific channelconfiguration information is attached, and wherein said mixing and/orswitching is controlled such that in case of non-matching number ofchannels and/or types of channel configurations the number and/orconfiguration of the channels to be output following said mixing and/orfollowing said switching is determined by related specific mixing and/orswitching information provided from a content provider or broadcaster,

and wherein to the combined data stream to be presented acorrespondingly updated channel configuration information is attached.

In principle the inventive apparatus includes:

-   -   at least two audio data decoders that decode audio data received        or replayed from a bitstream;    -   means for processing the audio signals initially decoded by said        audio data decoders, wherein at least two of said decoded audio        signals each have a different number of channels and/or a        different channel configuration, and wherein said processing        includes combination by mixing and/or switching;    -   means for presenting the combined audio signals in a final        channel configuration, wherein to each one of said initially        decoded audio signals a corresponding specific channel        configuration information is attached,    -   wherein in said processing means said mixing and/or switching is        controlled such that in case of non-matching number of channels        and/or types of channel configurations the number and/or        configuration of the channels to be output following said mixing        and/or following said switching is determined by related        specific mixing and/or switching information provided from a        content provider or broadcaster, and wherein to the combined        data stream fed to said presenting means a correspondingly        updated channel configuration information is attached.

Advantageous additional embodiments of the invention are disclosed inthe respective dependent claims.

DRAWINGS

Exemplary embodiments of the invention are described with reference tothe accompanying drawings, which show in:

FIG. 1 Transparent channel configuration information flow in a receiver;

FIG. 2 Channel configuration flow conflicts in a receiver;

FIG. 3 Inventive receiver including an additional nodeAudioChannelConfig.

EXEMPLARY EMBODIMENTS

In FIG. 2 a first decoder 21 provides a decoded ‘5.1 multichannel’signal via an AudioSource node or interface 24 to a first input In1 ofan AudioMix node or mixing stage 27. A second decoder 22 provides adecoded ‘2.0 stereo’ signal via an AudioSource node or interface 25 to asecond input In2 of AudioMix node 27. The AudioMix node 27 represents amultichannel switch that allows to connect any input channel or channelsto any output channel or channels, whereby the effective amplificationfactors used thereby can have any value between ‘0’=‘off’ and ‘1’=‘on’,e.g. ‘0.5’, ‘0.6’ or ‘0.707’. The output signal from AudioMix node 27having a ‘5.1 multichannel’ format is fed to a first input of anAudioSwitch node or switcher or mixing stage 28. A third decoder 23provides a decoded ‘1 (centre)’ signal via an AudioSource node orinterface 26 to a second input of AudioSwitch node 28.

The functionality of this AudioSwitch node 28 is similar to that of theAudioMix node 27, except that the ‘amplification factors’ used thereincan have values ‘0’=‘off’ or ‘1’=‘on’ only. AudioMix node 27 and Audioswitch node 28 are controlled by a control unit or stage 278 thatretrieves and/or evaluates from the bitstream received from a contentprovider or broadcaster e.g. channel configuration data and other datarequired in the nodes, and feeds these data items to the nodes. Audioswitch node 28 produces or evaluates sequences of switching decisionsrelated to the selection of which input channels are to be passedthrough as which output audio channels. The corresponding whichChoicedata field specifies the corresponding channel selections versus timeinstants. The audio output signal from AudioSwitch node 28 having a ‘2.0stereo’ format is passed via a Sound2D node or interface 29 to the inputof a presenter or reproduction stage 20.

In FIG. 2 two different conflicts are shown. The first conflict occursin the mix node 27, where a mix of a stereo signal into the surroundchannels in a 5.1 configuration is shown. The question is, for example,whether the resulting audio output signal should have 5.1 channels, orthe 5.1 surround channels should become 2.0 stereo format channels. Incase of selecting a 5.1 output format the straight-forward solutionwould be to assign input signal L2 to the first output channel 1ch andinput signal R2 to the second output channel 2ch. However, there aremany other possibilities. The content provider or broadcaster coulddesire to assign input signal L2 to output channel 4ch and input signalR2 to output channel 5ch instead. However, the current version of theabove MPEG-4 standard does not allow such feature.

The second conflict occurs in the sequence of whichChoice data fieldupdates in the AudioSwitch node 28. Within this sequence, channels outof the AudioMix node 27 output and the single channel output fromAudioSource node 26 are sequentially selected at specified timeinstants. The time instants in the whichChoice data field can be definedby e.g. every succeeding frame or group of frames, every predeterminedtime period (for instance 5 minutes), each time the content provider orbroadcaster has preset or commanded, or upon each mouse click of a user.In the example given in FIG. 2, at a first time instant input signal C1is connected to output channel 1ch and input signal M is connected tooutput channel 2ch. At a second time instant input signal L1 isconnected to output channel 1ch and input signal R1 is connected tooutput channel 2ch. At a third time instant input signal LS1 isconnected to output channel 1ch and input signal RS1 is connected tooutput channel 2ch. Within this sequence, channels out of the AudioMixnode 27 output and the single channel output from AudioSource node 26are sequentially selected. However, because of the contradictory inputinformation in node 28, no correct output channel configuration can bedetermined automatically based on the current version of the aboveMPEG-4 standard.

Based on the assumption that the content provider or broadcaster is tosolve such conflicts, three inventive solutions are feasible that areexplained in connection with FIG. 3. A first decoder 21 provides adecoded ‘5.1 multichannel’ signal via an AudioSource node or interface24 to a first input of an AudioMix node or mixing stage 27. A seconddecoder 22 provides a decoded ‘2.0 stereo’ signal via an AudioSourcenode or interface 25 to a second input of AudioMix node 27. The outputsignal from AudioMix node 27 having a ‘5.1 multichannel’ format is fedto a first input of an AudioSwitch node or switcher or mixing stage 28.A third decoder 23 provides a decoded ‘1 (centre)’ signal via anAudioSource node or interface 26 to a second input of AudioSwitch node28. The decoders may each include at the input an internal or externaldecoding buffer. The output signal from AudioSwitch node 28 having a‘2.0 stereo’ format is passed via a Sound2D node or interface 29 to theinput of a presenter or reproduction stage 20.

AudioMix node 27 and Audio switch node 28 are controlled by a controlunit or stage 278 that retrieves and/or evaluates from the bitstreamreceived from a content provider or broadcaster e.g. channelconfiguration data and other data required in the nodes, and feeds thesedata items to the nodes.

A new audio node, called AudioChannelConfig node 30 is introducedbetween AudioSwitch node 28 and Sound2D node 29. This node has thefollowing properties or function:

AudioChannelConfig{ exposedField SFInt32 numChannel 0 exposedFieldMFInt32 phaseGroup 0 exposedField MFInt32 channelConfig 0 exposedFieldMFFloat channelLocation 0,0 exposedField MFFloat channelDirection 0,0exposedField MFInt32 polarityPattern 1 },expressed in the MPEG-4 notation. SFInt32, MFInt32 and MFFloat aresingle field (SF, containing a single value) and multiple field (MF,containing a multiple values and the quantity of values) data types thatare defined in ISO/IEC 14772-1:1998, subclause 5.2. ‘Int32’ means aninteger number and ‘Float’ a floating point number. ‘exposedField’denotes a data field the content of which can be changed by the contentprovider or broadcaster per audio scene.

The phaseGroup (specifies phase relationships in the node output, i.e.specifies whether or not there are important phase relationships betweenmultiple audio channels) and the numChannel (number of channels in thenode output) fields are re-defined by the content provider due to thefunctional correlation with the channelConfig field or parameters.

The channelConfig field and the below channel configuration associationtable can be defined using a set of pre-defined index values, therebyusing values from the ISO/IEC 14496-3:2001 audio part standard, chapter1.6.3.4. According to the invention, it is extended using some values ofchapter 0.2.3.2 of the MPEG-2 audio standard ISO/IEC 13818-3:

TABLE 1 Channel configuration association index No. of audio syntacticelements, Channel to speaker value channels listed in order receivedmapping 0 unspeci- unspecified channelConfiguration fied from child nodeis passed through 1 — Escape sequence The channelLocation,channelDirection and polarityPattern fields are valid 2 1single_channel_element centre front speaker 3 2 channel_pair_elementleft, right front speakers 4 3 single_channel_element, centre frontspeaker, channel_pair_element left, right front speakers 5 4single_channel_element, centre front speaker, channel_pair_element,left, right centre single_channel_element front speakers, rear surroundspeakers 6 5 single_channel_element, centre front speaker,channel_pair_element, left, right front channel_pair_element speakers,left surround, right surround rear speakers 7 5 + 1single_channel_element, centre front speaker, channel_pair_element,left, right front channel_pair_element, speakers, left lfe_elementsurround, right surround rear speakers, front low frequency effectsspeaker 8 7 + 1 single_channel_element, centre front speaker,channel_pair_element, left, right centre channel_pair_element, frontspeakers, left, channel_pair_element, right outside front lfe_elementspeakers, left surround, right surround rear speakers, front lowfrequency effects speaker 9 2/2 MPEG-2 L, R, LS, RS left, right frontspeakers, left surround, right surround rear speakers 10 2/1 MPEG-2 L,R, S left, right front speakers, rear surround speaker . . .

Advantageously, an escape value ‘1’ is defined in this table having e.g.index ‘1’, in the table. If this value occurs, the desired channelconfiguration is not listed in the table and therefore the values in thechannelLocation, channelDirection and polarityPattern fields are to beused for assigning the desired channels and their properties. If thechannelConfig index is an index defined in the table, thechannelLocation, channelDirection and polarityPattern fields are vectorsof the length zero.

In the channelLocation and channelDirection fields a 3D-float vectorarray can be defined, whereby the first 3 float values(three-dimensional vector) are associated with the first channel, thenext 3 float values are associated with the second channel, and so on.

The values are defined as x,y,z values (right handed coordinate systemas used in ISO/IEC 14772-1 (VRML 97)). The channelLocation valuesdescribe the direction and the absolute distance in meter (the absolutedistance has been used because simply the user can generate a normalisedvector, as usually used in channel configuration). The channelDirectionis a unit vector with the same coordinate system. E.g. channelLocation[0, 0, −1] relative to the listening sweet spot means centre speaker inone-meter distance. Three other examples are given in the three lines oftable 2:

TABLE 2 Examples for channelLocation and channelDirectionchannelLocation channelDirection X Y Z X Y Z Location 0 0 −1 0 0 1center front speaker k * sin(30°) 0 k * −cos(60°) −sin(30°) 0 cos(60°)right front speaker k * −sin(45°) k * sin(45°) k * −cos(45°)  sin(45°)−sin(45°) cos(45°) Ambisonic Cube (LFU) Left Front Up

The polarityPattern is an integer vector where the values are restrictedto the values given in table 3. This is useful for example in case ofDolby ProLogic sound where the front channels have monopole pattern andthe surround channel have dipole characteristic.

The polarityPattern can have values according to table 2.

TABLE 1 polarityPattern association Value Characteristics 0 Monopole 1Dipole 3 Cardioide 4 Headphone . . . . . .

In an alternative embodiment of the invention, the additionalAudioChannelConfig node 30 is not inserted. Instead, the functionalityof this node is added to nodes of the type AudioMix 27, AudioSwitch 28and AudioFX (not depicted).

In an further alternative embodiment of the invention, the above valuesof the phaseGroup fields are additionally defined for the correspondingexisting nodes AudioMix, AudioSwitch and AudioFX in the first versionISO/IEC 14496 of the MPEG-4 standard. This is a partial solution wherebythe values for the phase groups are taken from above table 1 except theescape sequence. Higher values are reserved for private or future use.For example, channels having the phaseGroup 2 are identified asleft/right front speakers.

1. Method for processing two or more decoded but not yet combinedindividual audio signals received or replayed from different audiosources, wherein at least two of said decoded audio signals have adifferent number of channels per decoded audio signal and differentchannel configurations for channel to speaker mapping, and wherein saiddecoded audio signals are to be combined by mixing and/or switchingbefore being presented in a final channel configuration, and wherein toeach one of said decoded audio signals a corresponding specific channelconfiguration information item representing corresponding number ofchannels and channel configuration for said each decoded audio signal isattached and the channel configuration information items for said two ormore decoded audio signals can represent conflicting numbers of channelsper decoded audio signal and conflicting channel configurations, saidmethod comprising: controlling said mixing and/or switching such that incase of conflicting numbers of channels and conflicting channelconfigurations the number of the channels and the configuration of thechannels to be output following said mixing and/or following saidswitching is determined by specific mixing and/or switching informationprovided from a content provider or broadcaster that is embedded in atleast one of said audio signals, so as to resolve such conflict, andattaching to the combined data stream to be presented a correspondinglyupdated channel configuration information item, wherein saidcorrespondingly updated channel configuration information itemrepresents said determined number of channels and configuration. 2.Method according to claim 1, wherein said bitstream has MPEG-4 format.